Tracked 3D ultrasound and deep neural network-based thyroid segmentation reduce interobserver variability in thyroid volumetry

Thyroid volumetry is crucial in the diagnosis, treatment, and monitoring of thyroid diseases. However, conventional thyroid volumetry with 2D ultrasound is highly operator-dependent. This study compares 2D and tracked 3D ultrasound with an automatic thyroid segmentation based on a deep neural network regarding inter- and intraobserver variability, time, and accuracy. Volume reference was MRI. 28 healthy volunteers (24—50 a) were scanned with 2D and 3D ultrasound (and by MRI) by three physicians (MD 1, 2, 3) with different experience levels (6, 4, and 1 a). In the 2D scans, the thyroid lobe volumes were calculated with the ellipsoid formula. A convolutional deep neural network (CNN) automatically segmented the 3D thyroid lobes. 26, 6, and 6 random lobe scans were used for training, validation, and testing, respectively. On MRI (T1 VIBE sequence) the thyroid was manually segmented by an experienced MD. MRI thyroid volumes ranged from 2.8 to 16.7ml (mean 7.4, SD 3.05). The CNN was trained to obtain an average Dice score of 0.94. The interobserver variability comparing two MDs showed mean differences for 2D and 3D respectively of 0.58 to 0.52ml (MD1 vs. 2), −1.33 to −0.17ml (MD1 vs. 3) and −1.89 to −0.70ml (MD2 vs. 3). Paired samples t-tests showed significant differences for 2D (p = .140, p = .002 and p = .002) and none for 3D (p = .176, p = .722 and p = .057). Intraobsever variability was similar for 2D and 3D ultrasound. Comparison of ultrasound volumes and MRI volumes showed a significant difference for the 2D volumetry of all MDs (p = .002, p = .009, p <.001), and no significant difference for 3D ultrasound (p = .292, p = .686, p = 0.091). Acquisition time was significantly shorter for 3D ultrasound. Tracked 3D ultrasound combined with a CNN segmentation significantly reduces interobserver variability in thyroid volumetry and increases the accuracy of the measurements with shorter acquisition times.


Introduction
Exact thyroid volumetry plays an essential role in monitoring and treatment of many thyroid diseases, especially for the radioiodine therapy (RIT) in hyperthyroidism: The needed activity, commonly calculated using Marinelli's formula, is proportional to the thyroid mass [1]. The latter is obtained using the thyroid volume measurements from two-dimensional ultrasound (2D US) using the ellipsoid formula and a correction factor [2]. The probability of a second treatment after treatment failure is dependent on the dose concept, thyroid volume, and achieved organ dose [3]. Therefore, underestimating the thyroid volume could lead to an underestimated administered dose and thus to the need for a second RIT. On the other hand, overdosage can lead to hypothyroidism and radiation-induced malignancies [4]. Volumetry is also important in diagnosing different thyroid diseases. For example, the thyroid volume is considered when interpreting altered values of TSH (thyroid-stimulating hormone). Currently, 2D US is the standard procedure for thyroid volumetry despite its high inter-and intraobserver variability. Three-dimensional (3D) US was first introduced in the 1970s, and it is believed to be superior to 2D US in terms of user dependency [5]. Similarly, the automatic compounding of 2D US sequences to generate 3D US has been shown to reduce variability [6]. However, the potential advantages of 3D US for volumetry only can be exploited clinically in combination with automatic segmentation. Only in this way 3D US can keep up the speed and ease of use of 2D US employing the ellipsoid formula. Andermann et al. [7] compared volume measurements in 2D and 3D US on healthy volunteers. In the 3D case, the volume was estimated by a multiplanar approximation. However, this 3D approach does not use machine learning and cannot be easily automatized. In another work, Chang et al. [8] developed a thyroid segmentation algorithm for US images based on a radial basis function neural network and a region growing method for shape recovery. Afterwards, a population-based stochastic optimization technique was used to estimate the thyroid volume. No 3D compounding was used. Poudel et al. [9] compared different non-automatic and machine-learning-based segmentation algorithms for freehand 3D US of healthy volunteers. In this work, automatic approaches show better results compared to non-automatic ones. Furthermore, the same group around Friebe et al. analyzed several machine learning and image enhancing methods for thyroid segmentation in US [10][11][12][13]. These works do not evaluate, yet, the impact of intraand inter-observer variability on volumetry. Another automatic segmentation has been developed by Webb et al. [14]. They created a network for the segmentation of nodules, thyroid, and cysts on US thyroid cineclips. Their work, however, focused on the network development and its accuracy with respect to manually segmented 2D ground truth.
In contrast, we present an analysis on volumetry and variability in a realistic setup using 3D compounded ultrasound volumes acquired with a tracked ultrasound system. Analyzing intraand interobserver variability in thyroid volumetry and comparing 2D and 3D US in this task is a fairly well-analyzed topic in literature [5,6,[15][16][17][18][19][20][21][22][23]. However, the combination of 3D US and machine learning for volumetry currently lacks exploration. In this paper, we present a user dependency study on thyroid volumetry with 2D and 3D US in employing a fully automatic segmentation of thyroid lobes by a deep neural network. We aim to analyze whether the hypothesis that 3D US imaging combined with this deep neural network is superior to conventional 2D US in terms of accuracy and variability of thyroid volume estimations is valid. Intraand interobserver variability was calculated for both modalities and then compared with each other. The accuracy was estimated by comparing both US methods to Magnetic Resonance Imaging (MRI). In this paper, our contributions are: • We introduce a fully automatic method for 3D volumetry of thyroid lobes using tracked 3D US.
• We comprehensively evaluate our method compared to 2D US by analyzing intra-and interobserver variability and accuracy by comparing to MRI.
• We evaluate the acquisition time performance of our method compared to 2D US.
• We provide a novel tracked 3D US thyroid dataset as the first bigger and labelled open source dataset that will become publicly available upon acceptance.

Materials
This study was conducted in the scope of an Institutional Review Board Approval from the Ethical Commission of the Technical University of Munich (approved on April 1st, 2020; reference number 244/19 S). All volunteers gave written informed consent to the US and MRI scans. Eligible participants had to be anamnestically healthy and at least 18 years old and were recruited from the hospital and research lab environment. Exclusion criteria for possible candidates were subtotal or total thyroidectomy, distinct postoperative or inherent changes in the neck anatomy and MRI contraindications (i.e. claustrophobia, neurostimulators, pacemaker, cochlear implants, medication pumps and shrapnels). Twenty-eight healthy volunteers, aged 23 to 50, eighteen males and ten females, participated in the study. One high-resolution T1 VIBE (Volumetric interpolated breath-hold examination) MRI scan of the neck was acquired with a Biograph mMR machine (Siemens Healthineers AG, Erlangen, Germany) with a field strength of 3 T. The resolution of the MRI image was 0.625x0.625x1.00 mm3, and the field of view 320x320x80 mm (Fig 1). The accuracy of the MRI segmentation was evaluated with a thyroid phantom (thyroid ultrasound training phantom, model 074, CIRS). According to the manufacturer, the thyroid volume of each phantom has slight variations of +/-3 to 5 cc but is close to 30 cc. From the MRI scan segmentation, we retrieved a volume of 26.33 ml. US images were obtained with an ACUSON NX3 machine (Siemens Healthineers AG, Erlangen, Germany) using a linear 12 MHz VF12-4 probe (Fig 2). The US images were acquired with a resolution of 668 × 599 and a spacing of 0.0753 × 0.0751mm. For the 3D US, electromagnetic tracking, and video acquisition, a PIUR tUS system was used (piur imaging GmbH, Vienna, Austria). This device employs two 6D tracking sensors, operating at a frequency of 80 Hz with accuracy in position of 1.40 mm RMS and in angle of 0.50˚RMS. Video sequences were acquired with 89 fps. 3D US probes, which consist of matrix arrays, can also be used to obtain US volumes. However, these probes are not widely available, rather bulky, and often do not cover the entire thyroid making a compounding necessary. The image processing (3D US compounding using image data and tracking information) and thyroid segmentation was done using ImFusion Suite Version 2.9.7 (ImFusion GmbH, Munich, Germany). PyTorch was used for the implementation of a deep neural network for automatic 3D US thyroid segmentation and Tensorflow was used for the evaluation of the network. The model was trained on an NVIDIA Titan V 12 GB HBM2 using Polyaxon to schedule different runs (https://polyaxon. com/). The training time was around 10 hours in average. The inference on new volumes however took only few seconds. The segmentation pipeline was implemented and trained by our team and differed from the one offered by piur imaging in their latest release. We did not use piur's segmentation tool for this study to be able to modify the segmentation pipeline for our project and to be able to provide this as work open-source for the research community.

General approach
Three 2D and three 3D US scans for each volunteer were acquired by three physicians (MD1, MD2, and MD3) with different levels of experience (6, 4, and 1 years). The ellipsoid formula (correction factor 0.48) was applied to estimate the thyroid volumes from the 2D US scans. No isthmus correction was included. To acquire a volumetric image for a 3D US, the physician slowly sweeps the transducer over both lobes (one sweep per lobe). These sweeps are automatically compounded into a 3D US stack. The 3D US thyroid volumes were calculated based on the automatic segmentation. The resolution of the segmentation mask matched the one of the US image. As a result, volume calculation succeeded by counting all voxels labeled as thyroid.
The workflow for the 3D volumetry task can be seen in Fig 3. The thyroid volumes, derived after manual segmentation of the MRI scans by an experienced MD (8 years, cross-sectional imaging), were used as reference.

Deep neural network for automatic segmentation
For the automatic segmentation of the 3D US scans, a fully convolutional deep neural network developed by Roy et al. [24] named QuickNAT was used (Fig 4). The structure is similar to the U-net [25], although it can work with smaller datasets as well. Segmentation was done in 2D slice per slice of the compounded 3D volume. QuickNAT consists of dense blocks of four encoders and four decoders with a bottleneck layer in between, as depicted in Fig 4. Each dense block consists of three convolutional layers, two of 5x5 and one of 1x1 kernel size, preceded by a batch normalization layer and a Rectifier Linear Unit (ReLU). The bottleneck block of a 5x5 convolutional layer, including a batch normalization layer, restricts information flow in between the encoder and decoder. There are skip connections between the corresponding encoder and decoder blocks to provide encoder feature information to the decoders directly and improve training through a path of gradient flow to the deeper layers. Each encoder is followed by a 2x2 max-pooling block, which reduces the spatial dimension of the feature maps. To achieve better segmentation for even smaller structures, there are un-pooling layers between the decoder blocks as the indices corresponding to the maximum activations are passed to the decoders. The final layer is a classifier block with softmax, which is a convolutional layer of 1x1 kernel size. It maps the input to a feature map with a given number of channels, two in this case, thyroid and background. Subsequently, the softmax layer transforms the channels into probability maps for each class. Due to the size of our dataset, we evaluated a smaller version of the network with one less encoder and decoder. However, this structural change did not show any improvement in the results. The loss function of the network is a combination of Dice and cross-entropy losses. 26, 6, and 6 random lobe scans were used to train the network, the validation, and testing, respectively. The thyroid in these 3D US scans was manually segmented, slice by slice, by an experienced MD (8 years, cross-sectional imaging). Both lobes of the same volunteer were segmented manually which provided a total of 17.664 slices for training (i.e. in average 552 slices per lobe). This total was split into a training set of 14.352 (13 patients) and a validation set of 3.312 images (3 patients). The original architecture of QuickNAT was kept unchanged. A dropout of 0.5, as well as data augmentation, namely random vertical flips and elastic deformations [26], were applied to prevent overfitting. The weights on the Dice and cross-entropy losses were kept equal. Edge weights were used to improve the contour of the segmentation. The number of training epochs was chosen during preliminary experiments, where the Dice loss converged at 20 epochs or earlier. The network was trained with a batch size of 4 at a learning rate of 1e-5. Dice score amounted to 0.95, 0.94, and 0.83 for training, validation, and test sets. The model was applied to the remaining data (13 volunteers), and the segmentations were used for the volume evaluations (Fig 5).

Statistics and evaluation
From the complete data set, scans from 16 volunteers were used for training and validation. Thus, the network was trained on images that were segmented by the same network in a later step for the volumetry estimation. We, therefore, split the evaluation into two groups, one including scans that were fed to the network during training (15 volunteers) (Group 1) and the

PLOS ONE
other with only unseen data (13 volunteers) (Group 2) to be able to detect a potential bias. To estimate the intraobserver variability of the acquisitions, we used a formula from Lyshchik et al. [5]. A t-test was performed with 2D, and 3D mean values to decide on whether both samples are significantly different. Additionally, we estimated the intraobserver variability as originally proposed by Choi et al. [18]. We examined the differences between two volumes of the same measurement sequence on a Bland-Altman plot [27]. We show the mean and the 95 limit of agreement (±1.96 � SD). The interobserver variability was estimated by applying the method proposed by Choi et al. [18]. Here, the differences between volume measurements for each combination of two physicians were plotted in a Bland-Altman plot (Fig 7). Further, we applied paired samples t-tests to evaluate the difference between the mean volume measurements for each physician combination of two. In our case the objective was to determine whether the mean difference between the different measurements of the same volunteer is zero. We chose a significance level of α = 0.05 for all tests in both groups. To compare 2D and 3D US volume measurements to MRI, we applied paired samples t-tests between MRI the first

PLOS ONE
Tracked 3D ultrasound and segmentation reduce variability in thyroid volumetry scan sequence from each MD with the same significance level resulting in the same critical values per group.

Results
We will present all results for Group 1 and Group 2 separately.

Intraobserver variability
Intraobserver variability is summarized in Table 1. Fig 6 depicts the Bland-Altman plots for the intraobserver variability of each MD with the 2D data on the left and the 3D data on the right for Group 2. Significant differences were found comparing the sets from MD2 and MD3 in both Groups 1 and 2.

Interobserver variability
The interobserver variability can be seen in Table 2 for both groups. The Bland-Altman plots regarding the interobserver variability for Group 2 are depicted in Fig 7.

Volumetry compared to MRI data
For the US volume, the first scan series per MD was chosen and compared to the MRI volume. Table 3 summarizes these comparisons. The mean MRI volume overall volunteers of Group 1 was 7.73ml (SD = 2.71), while for Group 2, it was 7.00ml (SD = 3.48).

Time
Time needed for the 2D scans was on average 58.8 sec (SD 11.05 sec) for MD1, 55.0 sec (SD 8.8 sec) for MD2 and for 35.7 sec (SD 5.8 sec) MD3, while the 3D scans were performed on average in 26.0 sec (SD 5.5 sec) by MD1, 21.1 sec (SD 5.9 sec) by MD2 and in 21.6 sec (SD 5.2 sec) by MD3.

Comparison with state-of-the-art
Intra-and interobserver variability in thyroid volumetry is a fairly well analysed problem. Vulpoi et al. had three MDs examine 30 children with 2D US. One physician scanned 25 patients twice to gather data on intraobserver variability, yielding 6.29% (SD = 6.12%). The interobserver difference was 9.51% (SD = 8.8%) [16]. Lee et al. studied intraobserver and interobserver variability between doctors with different experience levels (10 years and 6 months) on 122 nodules in 73 volunteers. The two MDs had a similar intraobserver variability -11.4% (SD = 10.5%) and 11.3% (SD = 8.5%). The interobserver variability amounted to 15.3% (SD = 16.6%) [17]. In the study of Choi et al. the thyroid nodule volume of 73 patients with 85 Table 1 . 1 and 2). The last column shows whether a significant difference exists between the averaged 2D and 3D volume estimation sets, being the significant different p-values in bold. nodules was measured by two MDs with different levels of experience twice. The thyroid nodules were divided into two groups: nodules with a diameter smaller than 2cm and those with a diameter bigger than 2cm. In the first group the interobserver variability was 7.0% (SD = 4.9%) and -5.1% (SD = 3.6%) in the second group [17,18]. Our study focused on the total thyroid volume. In comparison to the results reported by

PLOS ONE
Tracked 3D ultrasound and segmentation reduce variability in thyroid volumetry nodules three times, and delineated their thyroids. The delineated thyroids were controlled by a second expert. The second study included 47 children prior to thyroidectomy. Here, the thyroid weights after surgery were used as ground truth. The nodule volumes were calculated with the ellipsoid formula for 2D and multiplanar volume approximation, respectively manual planimetry, for the 3D scans. Both studies showed very similar results. 3D US showed a lower user dependency than 2D US, had a more accurate thyroid nodule volumetry, was less dependent on nodule size and performed better on irregular nodule outlines. In the first study the intraobserver variability was 16.1% (SD = 0.7%) for 2D US and 5.9% (SD = 0.3%) for 3D US. The accuracy for 2D US was 15.9% and 2.8% for 3D US. In the second study the intraobserver variability amounted to 14.4% (SD = 1.9%) in 2D US and to 3.36% (SD = 0.25%) for 3D US. The accuracy amounted to 15.3% for 2D US and 5.2% for 3D US. The second study showed no significant variation between the thyroid weight and both US modalities [5,20]. One limitation of both their studies is that only one physician acquired the data. In our study intraobserver variability for 2D US yielded similar results to Lyshchik et al. Our results for 3D US however are worse. We speculate that the reason for this could be either (a) our network performance in segmenting the thyroid lobes, (b) breathing or movement distorsions as we are using tracked ultrasound instead of a two-dimensional array ultrasound or (c) tracking errors.
With respect to MRI, Reinartz et al. compared 2D US thyroid volumetry to MRI (T1 FFE). Three different experienced MDs performed US scans on patients prior to RIT. The thyroid volumes were calculated by using the ellipsoid formula for the 2D US measurements and by manual segmentation. A significant difference of 22.7% (10.4ml), between the MRI-ROI and mean 2D US measurements was found. No significant interobserver variability among the MDs could be determined [15]. This finding is in contradiction with our results. In our case the MRI volumes were on average 21.3% smaller than the ones obtained from 2D US and 4.99% smaller than the ones obtained from 3D US. This may have to do with the fact that different MRI sequences, resolutions or sampling were used, but most likely by the criteria used for segmentation [23]. One study comparing 3D US to CT by Licht et al showed no relevant differences [28]. Rogers et al. analyzed length, diameters and volume measurements of four arteries from a pig using 3D tUS, B-Mode US, CT and MRI. As validation, water immersion technique was used. B-Mode US had the largest error in volume estimation, tUS the smallest. The mean error in volume estimation was −0.54 ± 0, 62ml for B-Mode, −0.06 ± 0.09ml for tUS, 0.01 ± 0.18ml for CT and −0.20 ± 0.32ml for MRI. These results align with our results in showing that tracked 3D US shows a smaller difference to MRI than 2D US. Furthermore, it shows the second largest mean volume error in volume estimation for MRI. This aligns with our observation of a probable continuous underestimation of our MRI segmentations.

PLOS ONE
Tracked 3D ultrasound and segmentation reduce variability in thyroid volumetry

Method evaluation
The current network was trained and evaluated on ultrasound images from one ultrasound machine, with slightly different imaging parameters (such as gain, and focus). Based on this data, the network can handle small changes in the ultrasound image. For more prominent changes in the imaging parameters (such as frequency, ultrasound transducer, speckle filtering, among others), resulting in more considerable changes in the ultrasound image, we believe that the CNN would have to be adjusted in its architecture and fine-tuned to increase its generalizability. Compared to the current clinical practice, the improved performance can be based on the nature of the 3D acquisition and the automatic segmentation. 3D US itself makes a difference compared to 2D US as it removes the variability of the probe placement with respect to the anatomy through volume compounding. The automatic segmentation allows for a complete thyroid segmentation, which provides a shape closer to the actual thyroid form than an ellipsoid approximation.

Limitations of this study
Only young and healthy volunteers were enrolled, simplifying the segmentation and volumetry tasks.The CNN was trained with 16 volunteers only, allowing for a good Dice score but leaving space for improvements. These can include cross-validation as well as training on 3D input data instead of 2D images. The latter one might allow for additional spatial information which could improve the network performance; yet, it would require a significantly larger dataset. Visual analysis also showed non-optimal segmentation results in few scans (Fig 5). The network seems to be sensitive to the echogenicity of the US images, resulting in additional areas outside of the thyroid to be segmented and areas inside the thyroid to be not segmented. Furthermore, the network was trained on single lobes. This can introduce a potential deviation in the volume due to overlapping or non-segmented parts of the isthmus. It also has to be noted that not all MDs have used the 3D system before. A longer training could therefore increase the scanning quality.

Conclusion
Our study shows that 3D US outperforms 2D US significantly. This was consequently seen in both groups 1 and 2 denoting "training" and "new" data. Regarding the intraobserver variability, 3D US outperforms 2D US with 2 out of 3 MDs. In the interobserver variability 3D US increases the similarity of acquisition scans between two MDs in 2 out of 3 cases. The results also show that the improvement with 3D US is the biggest for the least experienced MD. We show that the 3D US volumetry is more accurate than 2D US using MRI as ground truth. The tracked 3D US acquisition is significantly faster as well. Therefore, the combination of tracked 3D US and automatic segmentation is not only feasible but offers a more accurate alternative to conventional US imaging. The labelled thyroid 3D US dataset will be freely available helping further projects in deep learning for thyroid volumetry within the research community. We suggest conducting similar studies on patients with thyroid morbidities to stratify the clinical benefit of our framework.